Project Background

Authors

Matthew Pannell - Matt is from Northern Virginia. Specifically, Springfiled, Virginia. His favorite place to eat in blacksburg is either Cabo Fish Taco or Zeppoli’s. In his freetime, he enjoys playing video games and being outside. His favorite video game is Civilization and his favorite hike in Blacksburg is Bald Knob. Responsible for the Multiple Regression, kNN, Naive Bayes, and formatted the dashboard.

Jason Bruno Teceros - Jason is from Herndon, VA, which is in the Northern Virginia area. He is not too far from where I grew up. Says his favorite dish to eat is a Chicken and Lamb over Rice (mixed lamb over rice) from Italianos. In his free time he enjoys to play video games as well and to read. His favorite game is Call of Duty and his favorite author is Steven King. He wishes he can be more active and participate in sports like he did when he was in high school. Responsible for Rigde Regression, Logistic Regression, Project Background page of the dashboard, and helped with general dashboard formatting.

Porter Lin - I am from China. Currently my favorite place to eat is Perry place. It is near the DDS building. My next favorite is Chipotle. My favorite music genre is classical, but recently I began picking up other genres as well. I don’t have one specific favorite game since I often plays awesome games by small indie creators, but out of all games I play, I play Minecraft the most. I did LOESS regression and helped with formatting the dashboard.

Abstract

This dashboard analyzes global development trends from 2000 to 2022 using World Bank indicators across multiple different countries and a range of socioeconomic and infrastructure-related variables. We investigate this data through different research questions using both regression and classification techniques. Our analysis spans GDP per capita, income inequality (Gini index), unemployment, and others that will be discussed and addressed in this dashboard. We apply both regression and classification techniques to analyze this data, such as; Multiple Regression (non-linear), Ridge Regression fit, LOESS fit, kNN classification, Naive Bayes Classification, and Logistic Regression. Our results will be visualized using a variety of maps and plots that offer a clear insight to the global patterns and by comparison; answering our respective research questions.

Introduction

This project leverages socioeconomic and infrastructure indicators from the World Bank, spanning the years 2000 to 2022, to investigate global development patterns across a diverse set of countries. The dataset includes economic measures such as GDP per capita, inflation, and unemployment, as well as social indicators like life expectancy, school enrollment, and health expenditure. Additionally, it incorporates infrastructure and technology access metrics, such as electricity access, internet usage, and mobile phone subscriptions.

Using this data, we explore research questions related to the economic growth. Our methods include ridge regression to understand linear relationships, LOESS smoothing to uncover non-linear trends over time, and logistic regression for classification tasks—particularly focusing on modeling digital inclusion. The dashboard offers a visual and interactive means of understanding these complex, multidimensional relationships. The data used can be found here: https://drive.google.com/drive/u/1/folders/16j7E2yBUDPmfGM00o7rGVGaCsE9BYhnK. Additionally you can download the data yourself from the world bank website here: https://databank.worldbank.org/source/world-development-indicators.

Data Dictionary

Variable.Name Description
Country Name Full name of the country
Country Code Three-letter ISO country code
Year Calendar year of the observation
GDP per capita (current US) Gross Domestic Product divided by midyear population, in current U.S. dollars
Gini index Measure of income inequality (0 = perfect equality, 100 = perfect inequality)
Unemployment, total (% of total labor force) Percentage of total labor force that is unemployed (national estimate)
Inflation, consumer prices (annual %) Annual percentage change in consumer prices
Exports of goods and services (% of GDP) Total exports as a percentage of GDP
Gross capital formation (% of GDP) Investment in fixed assets plus net changes in inventories
Life expectancy at birth, total (years) Average number of years a newborn is expected to live
School enrollment, tertiary (% gross) Gross enrollment ratio in tertiary education
Current health expenditure per capita (current US$) Per capita expenditure on healthcare, in current U.S. dollars
Population growth (annual %) Annual population growth rate
Access to electricity (% of population) Percentage of the population with access to electricity
Individuals using the Internet (% of population) Percentage of individuals who use the Internet
Mobile cellular subscriptions (per 100 people) Number of mobile subscriptions per 100 people
Urban population (% of total population) Percentage of total population living in urban areas

Multiple Regression

Research Question

Can we predict GDP per capita of a country in USD from total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area?

Original Data


Looking at the matrix plot, the strongest relationship with GDP per capita is clearly health expenditure—there’s a tight, upward trend in the scatterplot, and the correlation is really high at 0.944. School enrollment and urban population also show positive relationships with GDP, but they’re not as strong. The points are more spread out and the correlations are a lot lower.

The distributions for GDP and health expenditure are both heavily skewed, with a lot of values clustered on the low end and a long tail of high values. That suggests we should log-transform those two variables to make the relationships more linear and improve the model fit. Urban population is a percentage and only mildly skewed, but the scatterplot with GDP still shows some curvature. In that case, a square root transformation is a better choice than log since it’s more appropriate for percentage-based variables and makes the relationship more linear without distorting the scale too much.

Transformed Data


After applying the transformations, the relationships between the variables look a lot more linear. The log transformation on GDP and health expenditure really helped tighten up the scatterplots, especially between log_gdp and log_health, which now shows an extremely strong, nearly perfect linear trend (Corr: 0.982). The correlation between GDP and the other two predictors—school enrollment and urban population also improved. School is now at 0.602, and the transformed urban variable (sqrt_urban) is at 0.722.

The density plots also look a lot better. log_gdp and log_health are now more normally distributed, and while sqrt_urban is still a little skewed, it’s a definite improvement over the original scale. The scatterplots overall look tighter and more consistent, which means the assumptions for multiple regression hold.

Fit Model


Call:
lm(formula = log_gdp ~ school + log_health + sqrt_urban, data = train_mlr)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.21562 -0.13358 -0.01252  0.14266  0.98148 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.8304094  0.0699193  54.783   <2e-16 ***
school      -0.0003958  0.0005025  -0.788    0.431    
log_health   0.8383025  0.0093836  89.337   <2e-16 ***
sqrt_urban  -0.0082864  0.0119256  -0.695    0.487    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2519 on 780 degrees of freedom
Multiple R-squared:  0.9637,    Adjusted R-squared:  0.9636 
F-statistic:  6901 on 3 and 780 DF,  p-value: < 2.2e-16

The model explains about 96% of the variation in \(\log\) GDP per capita, which means the three predictors together do a very good job of accounting for differences in GDP across countries. The F-test result shows that the overall model is statistically significant.

Looking at the coefficients, log_health has a strong positive effect on GDP, while school has a very small negative effect. The coefficient for sqrt_urban is also close to zero, suggesting it doesn’t have much impact in this model.

Studentized Residual Plot


The residuals are generally well spread out across the range of fitted values, which suggests that the model meets the assumption of constant variance. There’s no clear curvature or funnel shape, so the linearity and homoscedasticity assumptions appear to hold. There is definitely an outlier in the bottom left.

That said, there is some bunching of points between fitted values of 9 and 11, where the residuals seem to cluster more tightly around zero. This might reflect a large number of countries with similar predicted GDP values that the model was able to predict very accurately.

Leverage Plots


Horizontal dashed line represents the leverage cutoff, calculated by \(3(p/n)\). Many of the points are below the the cutoff. However, there are some points that exceed the line and have influence in pulling the regression line. Further research would need to be done to determine if the points are accurate.

Cook’s Distance


Horizontal dashed line represents the \(4/(n-p-1)\) cut off value. There are a good bit of data points or spikes below or around the cutoff line, but there are a few that exceed the cutoff value. There are two points that reach 0.15 and about 0.07 that stand out as high influence points. Further investigation would be needed to see if these points are accurate or of importance for our model.

QQ Plot


The studentized residuals follow the QQ plot pretty closely along the 45 degree line. The data points are right along the line in the middle of the plot and the tails deviate a little, but overall the plot is good and the data follows the normality assumption.

Fit Testing Data

In order to test our multiple regression model, we split our data into a 70-30 training-testing split.


Model Error on Test Set
Metric Value
Root Mean Squared Error (RMSE) 0.2351

Conclusion

Yes, we can predict GDP per capita of a country from total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area. In fact our model has a very good R^2 value and does a good job at this task. Overall the linear regression assumptions were upheld, but further research could be done on the problematic leverage and influence points. Based on the model’s performance, it seems like linear regression is fine for this task but further work could be done using robust regression methods to deal with the leverage and influential points.

Ridge Regression

Research Question and Background

To what extent can a country’s GDP per capita be predicted by its inflation rate, unemployment, export activity and capital formation?

Why use Ridge: You have multiple potentially correlated predictors (e.g., inflation and unemployment) so Ridge helps control for multicollinearity while identify which economic indicators best explain GDP per capita variation.

Choose Best Lambda

Ridge Regression Model


Conclusion

In the Ridge Regression the red dots show the mean squared error that we found at different values of log(lambda) during cross-validation. The blue vertical line marks the optimal lambda found. This lamdba minimizes the prediction error. The model performs best with relatively low regularization penalty, indicating that the predictors contribute useful information and multicollinearity is present but not severe. The curve is smooth and U-shaped, showing the trade-off between underfitting and overfitting.

After a log-transforming GDP per capita, the Ridge Regression model shows a more stable and linear relationship to our data. The predictions are close to the actual values amd suggest that the transformation corrected the skewness we had previously. The model captures the overall trend well.

LOESS

Research Question and Background

We aim to use LOESS to predict a country’s GDP per capita for several countries. We use both degree-1 and degree-2 polynomials to compare how good the fit was. This is ultimately determined by comparing their MSEs. A lower MSE indicates a better fit, but we need to take care of overfitting issues. In all analyses, the span is set to 0.5.

We chose 6 countries for the analyses. The full data range is from 2000 to 2021, but some countries may have data missing for some years.

United Kingdom

Summary Statistics
degree SSE MSE
1 26000456 1181838.9
2 10837696 492622.5

Uruguay

Summary Statistics
degree SSE MSE
1 13101.512 595.5233
2 8579.084 389.9584

China

Summary Statistics
degree SSE MSE
1 176695.5 13591.958
2 63067.9 4851.377

France

Summary Statistics
degree SSE MSE
1 1.638410e+07 7.447320e+05
2 6.172062e+01 2.805483e+00

Indonesia

Summary Statistics
degree SSE MSE
1 747.0884 33.95856
2 10663.7922 484.71783

Colombia

Summary Statistics
degree SSE MSE
1 127493.15 6710.166
2 17296.14 910.323

Conclusion

From the analyses above, we can see that LOESS tends to have greater MSE when there is more fluctuation in a country’s growth, as in the case for the UK. For Uruguay and China, which showed a more graduate growth, the MSE is lower. Usually, we expect a degree-2 fit gave a lower MSE than a degree-1 fit, but for some countries like Indonesia and France, a degree-2 LOESS yielded a higher MSE than degree-1. The analyses here use a span of 0.5, but we should try out different span values to find out the optimal fit for each data. However, while a lower MSE indicates a good fit, we must consider the problem of overfitting. This ensures that when we introduce new data, LOESS can capture most of the features.

kNN Classification

Research Question

Can we classify countries into high vs. low internet usage based on economic indicators such as GDP per capita of a country, total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area?

Set Up

For this research question we will create a binary variable based on if internet usage is greater than the median internet usage value. We will also prepare our data using min-max normalization. Lastly, we will divide the into a 70-30 training-testing split.

kNN Model

We will use the knn function from the class library. Additionally, we will test k values 1 through 10, to see which performs best.


k Accuracy Error
1 0.7515 0.2485
2 0.7784 0.2216
3 0.7874 0.2126
4 0.7844 0.2156
5 0.7874 0.2126
6 0.7665 0.2335
7 0.8144 0.1856
8 0.8114 0.1886
9 0.8144 0.1856
10 0.7904 0.2096

Choosing the Best k

As seen on the plot and table above, \(k=73\) and \(k=9\) are equal so we will just use \(k=7\) as the best \(k\) value to choose.

Confusion Matrix (K = 7)
High Low
High 128 23
Low 39 144
KNN Model Performance (K = 7)
Metric Value
Accuracy 0.8144
Error Rate 0.1856

Conclusion

Yes, we can classify countries into high vs. low internet usage based on economic indicators such as GDP per capita of a country, total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area? Additionally, as seen in the analysis \(k=3\) performs the best for this task.

Naive Bayes Classification

Research Question

Can we classify countries into high vs. low life expectancy based on development statistics like percent of the population that lives in an urban area, how much a country spend on healthcare per capita, and percent of the population who have access to electricity?

Set Up

For this research question, we will create a binary variable for high and low life expectancy based on if the life expectancy of a country is greater than the median life expectancy from the data. Additionally, we will create a 70-30 training-testing split.

Naive Bayes Model

Naive Bayes Confusion Matrix (Life Expectancy Class)
High Low
High 157 83
Low 10 84
Naive Bayes Model Performance Summary
Metric Value
Accuracy 0.7216
Error Rate 0.2784


This plot shows how confident the model is in predicting high life expectancy. It does really a really good job predicting high life expectancy countries, but is less confident on low life expectancy ones.

Conclusion

Yes, we can classify countries into high vs. low internet usage based on economic indicators such as GDP per capita of a country, total enrollment in tertiary education, how much a country spend on healthcare per capita, and the percentage of the population that lives in an urban area. As seen in our analysis above, the Naive Bayes model performs very well on this task.

Logistic Regression Classification

Research Question

How do life expectancy, income inequality, and unemployment influence the likelihood of a country being under developed?

Set Up

For this research question, we will define underdeveloped countries as those with life expectancy less than 70, and use this to create our classes.

Logistic Regression Model


Call:
glm(formula = low_expectancy ~ ., family = "binomial", data = logit_data)

Coefficients:
                                                                     Estimate
(Intercept)                                                         3.801e+00
`Unemployment, total (% of total labor force) (national estimate)` -1.079e-01
`Gini index`                                                       -5.121e-02
`GDP per capita (current US$)`                                     -6.504e-04
                                                                   Std. Error
(Intercept)                                                         6.835e-01
`Unemployment, total (% of total labor force) (national estimate)`  3.090e-02
`Gini index`                                                        1.552e-02
`GDP per capita (current US$)`                                      7.284e-05
                                                                   z value
(Intercept)                                                          5.562
`Unemployment, total (% of total labor force) (national estimate)`  -3.493
`Gini index`                                                        -3.300
`GDP per capita (current US$)`                                      -8.929
                                                                   Pr(>|z|)    
(Intercept)                                                        2.67e-08 ***
`Unemployment, total (% of total labor force) (national estimate)` 0.000478 ***
`Gini index`                                                       0.000965 ***
`GDP per capita (current US$)`                                      < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 770.24  on 1115  degrees of freedom
Residual deviance: 368.99  on 1112  degrees of freedom
AIC: 376.99

Number of Fisher Scoring iterations: 10



Conclusion

Conclusion: When evaluating Life expectancy to income inequality and unemployment rate, we can see there is a distinct downward trend. What are we seeing here? Well the higher the unemployment rate the lower the life expectancy is. We also see this when we incorporate income inequality and GDP per capita that our graph eventually normalizes where we assume it is at a lower life expectancy than 70 years.

We provided a separate graph to see how life expectancy compares to GDP per Capita and our other factors (unemployment and Gini index). We can see that the predictability of having lower GDP to higher unemployment rate, and a greater income inequality produces a higher probability of a country having a life expectancy below 70 years.